Improving Word Embeddings for Low Frequency Words by Pseudo Contexts
نویسندگان
چکیده
This paper investigates relations between word semantic density and word frequency. A distributed representations based word average similarity is defined as the measure of word semantic density. We find that the average similarities of low frequency words are always bigger than that of high frequency words, when the frequency approaches to 400 around, the average similarity tends to stable. The finding keeps correct with changes of the size of training corpus, dimension of distributed representations and number of negative samples in skip-gram model. It also keeps on 17 different languages. Basing on the finding, we propose a pseudo context skip-gram model, which makes use of context words of semantic nearest neighbors of target words. Experiment results show our model achieves significant performance improvements in both word similarity and analogy tasks.
منابع مشابه
Understanding and Improving Multi-Sense Word Embeddings via Extended Robust Principal Component Analysis
Unsupervised learned representations of polysemous words generate a large of pseudo multi senses since unsupervised methods are overly sensitive to contextual variations. In this paper, we address the pseudo multi-sense detection for word embeddings by dimensionality reduction of sense pairs. We propose a novel principal analysis method, termed ExRPCA, designed to detect both pseudo multi sense...
متن کاملIdentity-sensitive Word Embedding through Heterogeneous Networks
Most existing word embedding approaches do not distinguish the same words in different contexts, therefore ignoring their contextual meanings. As a result, the learned embeddings of these words are usually a mixture of multiple meanings. In this paper, we acknowledge multiple identities of the same word in different contexts and learn the identitysensitive word embeddings. Based on an identity-...
متن کاملSubstitute Based SCODE Word Embeddings in Supervised NLP Tasks
We analyze a word embedding method in supervised tasks. It maps words on a sphere such that words co-occurring in similar contexts lie closely. The similarity of contexts is measured by the distribution of substitutes that can fill them. We compared word embeddings, including more recent representations (Huang et al.2012; Mikolov et al.2013), in Named Entity Recognition (NER), Chunking, and Dep...
متن کاملImproving Document Ranking with Dual Word Embeddings
This paper investigates the popular neural word embedding method Word2vec as a source of evidence in document ranking. In contrast to NLP applications of word2vec, which tend to use only the input embeddings, we retain both the input and the output embeddings, allowing us to calculate a different word similarity that may be more suitable for document ranking. We map the query words into the inp...
متن کامل基於相依詞向量的剖析結果重估與排序(N-best Parse Rescoring Based on Dependency-Based Word Embeddings)
Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence. The re-ranking quality depends on the precision of the rescoring function. However it is a challenge to design an appropriate function to determine the qualities of parse trees. No matter which method is used, Treebank is a widely used resource in parsing task. ...
متن کامل